Accessor Variety Criteria for Chinese Word Extraction

نویسندگان

  • Haodi Feng
  • Kang Chen
  • Xiaotie Deng
  • Weimin Zheng
چکیده

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, , ‘percent’, and , ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extract Chinese Unknown Words from a Large-scale Corpus Using Morphological and Distributional Evidences

The representative method of using morphological evidence for Chinese unknown word (UW) extraction is Chinese word segmentation (CWS) model, and the method of using distributional evidence for UW extraction is accessor variety (AV) criterion. However, neither of these methods has been verified on large-scale corpus. In this paper, we propose extensions to remedy the drawbacks of these two metho...

متن کامل

High OOV-Recall Chinese Word Segmenter

For the competition of Chinese word segmentation held in the first CIPS-SIGHNA joint conference. We applied a subwordbased word segmenter using CRFs and extended the segmenter with OOV words recognized by Accessor Variety. Moreover, we proposed several post-processing rules to improve the performance. Our system achieved promising OOV recall among all the participants.

متن کامل

Normalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation

The word is the basic unit in natural language processing (NLP), as it is at the lexical level upon which further processing rests. The lack of word delimiters such as spaces in Chinese texts makes Chinese word segmentation (CWS) an interesting while challenging issue. This paper describes the in-depth research following our participation in the fourth International Chinese Language Processing ...

متن کامل

English-to-Chinese Machine Transliteration using Accessor Variety Features of Source Graphemes

This work presents a grapheme-based approach of English-to-Chinese (E2C) transliteration, which consists of many-to-many (M2M) alignment and conditional random fields (CRF) using accessor variety (AV) as an additional feature to approximate local context of source graphemes. Experiment results show that the AV of a given English named entity generally improves effectiveness of E2C transliteration.

متن کامل

Improved Automatic Keyword Extraction Based on TextRank Using Domain Knowledge

Keyword extraction of scientific articles is beneficial for retrieving scientific articles of a certain topic and grasping the trend of academic development. For the task of keyword extraction for Chinese scientific articles, we adopt the framework of selecting keyword candidates by Document Frequency Accessor Variety(DF-AV) and running TextRank algorithm on a phrase network. To improve domain ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 30  شماره 

صفحات  -

تاریخ انتشار 2004